Execution Model

In Spark, there are three modes to submit a job: 
  • Client mode 
  • Cluster mode.
  • Local Mode
Client Mode: When running Spark in the client mode, the SparkContext and Driver program run external to the cluster i.e driver is instantiated on the machine where the job is submitted. For example, from your laptop. It is often used for program development because the logs are directly displayed in the current terminal, and the instance of the driver is linked to the user’s session. This mode is not recommended in production because the Edge Node can quickly reach saturation in terms of resources and the Edge Node is a SPOF (Single Point Of Failure).   
 
 
 
 
 
After the Executors are launched they start communicating directly with the Driver program i.e. SparkSession or SparkContext and the output will be directly returned to the client.
 

 

The drawback of Spark Client mode w.r.t YARN is that: The client machine needs to be available at all times whenever any job is running. You cannot submit your job and then turn off your laptop and leave from office until your job is finished. While we work with this spark mode, the chance of network disconnection between “driver” and “spark infrastructure”  reduces. Since they reside in the same infrastructure. Also, reduces the chance of job failure.In this case, it won’t be able to give the output as the connection between Driver and Executors will be broken.    

Cluster Mode: The only difference in this mode is that Spark is installed in the cluster, not in the local machine. This is the most common, the user sends a JAR file or a  script to the Cluster Manager. The latter will instantiate a Driver and Executors on the different nodes of the cluster. 

  • When working in Cluster mode, all JARs related to the execution of your application need to be publicly available to all the workers. This means you can either manually place them in a shared place or in a folder for each of the workers. 
  • The CM is responsible for all processes related to the Spark application. It facilitates the allocation of resources and releases them as soon as the application is finished. 
  • Driver runs on one of the cluster's Worker nodes.It runs as a dedicated, standalone process inside the Worker. 
  • Each application gets its own executor processes, which stay up for the duration of the whole application and run tasks in multiple threads. This has the benefit of isolating applications from each other, on both the scheduling side (each driver schedules its own tasks) and executor side (tasks from different applications run in different JVMs.
  • While we work with this spark mode, the chance of network disconnection between “driver” and “spark infrastructure”  reduces. Since they reside in the same infrastructure. Also, reduces the chance of job failure.

 

 

 

 

 

Local Mode: The Driver and Executors run on the machine on which the user is logged in. It is only recommended for the purpose of testing an application in a local environment or for executing unit tests. 


Configurations to run a Spark Job on a YARN cluster
  • master – Determines how to run the job. 
  • deploy-mode – We selected ‘cluster’ to run the above SparkPi example within the cluster.  To run the problem outside of the cluster, then select the ‘client’ option.
  • driver-memory – The amount memory available for the driver process. In a YARN cluster Spark configuration the Application Master runs the driver.
  • executor-memory – The amount of memory allocated to the executor process
  • executor-cores – the total number of cores allocated to the executor process
  • queue – The YARN queue name on which this job will run.  If you have not already defined queues to your cluster, it is best to utilize the ‘default’ queue.

No comments:

Post a Comment